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Training versus Testing 


Roadmap 
@ When Can Machines Learn? 









Lecture 4: Feasibility of Learning 


learning is PAC-possible 
if enough statistical data and finite |H] 


@ Why Can Machines Learn? 





Lecture 5: Training versus Testing 


e Recap and Preview 

Effective Number of Lines 
Effective Number of Hypotheses 
Break Point 


Ө How Can Machines Learn? 
@ How Can Machines Learn Better? 
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Training versus Testing Recap and Preview 


Recap: the 'Statistical' Learning Flow 
if |H| = M finite, N large enough, 


for whatever g picked by А, Ecut(g) = Ein(g) 
if A finds one g with Ei. (g) = 0, 


PAC guarantee for Fout(g) 





= 0 — learning possible :-) 




















unknown target function unknown 
f: X — у Ponx 
(ideal credit approval formua), ZON 
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ў `a 
training examples е final hypothesis 
D: (X1, ya), , (XN, Ум) м g = f 
(historical records in bank) 








(‘learned’ formula to be used) 





hypothesis set 
H 
Eou(g), <: Fin(g) = 0 
(set of candidate formula) Е 
test train 


Training versus Testing Recap and Preview 


Two Central Questions 
for batch & supervised binary classification, g ~ f <= Eou(g) = 0 
———————————— 


lecture 3 lecture 1 


achieved through Eout(g) = Ei (g) and Ein(g) + 0 
M —— 


lecture 4 lecture 2 





learning split to two central questions: 


@ can we make sure that Eout(g) is close 
enough to Ein (g)? 


Ө can we make Ej, (9) small enough? 





what role does M play for the two questions? 
IH] | 


Training versus Testing Recap and Preview 


Trade-off on M 
© can we make sure that Ecut(g) is close enough to Ei. (g)? 
Ө can we make Ej, (9) small enough? 


© Yes!, 
P[BAD] < 2- M - exp(...) 
Ө No!, too few choices 




















Ө No!, 
P[BAD] < 2. №. exp(...) 


Ө Yes!, many choices 


using the right M (or H) is important 
M = co doomed? | 


Training versus Testing Recap and Preview 


Preview 





P [|En(g) — Eout(g)| > «| < 2- M - exp (-22№) 








e establish a finite quantity that replaces M 


Р [|En(9) = Еош(9)| > e] < 2. тн - exp (22м) 


• justify the feasibility of learning for infinite M 
e study rr, to understand its trade-off for ‘right’ H, just like M 





mysterious PLA to be fully resolved 
after 3 more lectures :-) | 
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Training versus Testing Recap and Preview 


Fun Time 


Data size: how large do we need? 


One way to use the inequality 





P [|En(9) — Eoa(g)| > < < 2- M - exp (-2&N) 
——— SS 
ó 
is to pick a tolerable difference < as well as a tolerable BAD 
probability ó, and then gather data with size (N) large enough to 
achieve those tolerance criteria. Let є = 0.1, ó = 0.05, and М = 100. 
What is the data size needed? 


Q 215 e 415 © 615 @ 815 





Reference Answer: e 


We can simply express N as a function of those 'known' variables. 
Then, the needed N = 55 In 27. 
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Training versus Testing Effective Number of Lines 


Where Did M Come From? 
P [| En(g) = Eout(g)| > e] <2- M. exp (-2ên) 
e BAD events 55: |Ein(hm) — Eou(hm)| > € 


e to give A freedom of choice: bound Р[В1 or 32 or ... By] 
e worst case: all B> non-overlapping 


Р[ В: or Boor... By] © Р[ В] + IP[55] НР ово Sr P[By] 
union bound 





where did uniform bound fail 
to consider for M = oo? | 


Training versus Testing Effective Number of Lines 


Where Did Uniform Bound Fail? 
union bound P[B4] + P[B2] +... + P[By] | 
e BAD events Bm: |Ein(Am) — Eou(hm)| > є fp, Bo 


overlapping for similar hypotheses Л = ho 
° why? (1) Eout(hi) = Eou(h2) 
(2) for most D, E (hi) = Ein(he) 
e union bound over-estimating 


to account for overlap, 
can we group similar hypotheses by kind? | 


B; 
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Training versus Testing Effective Number of Lines 


How Many Lines Are There? (1/2) 


= {all lines in R^ | 


e how many lines? oo 
e how many kinds of lines if viewed from one input vector x4? 





2 kinds: hi-like(X1) = o or he-like(x1) = x 





Training versus Testing Effective Number of Lines 


How Many Lines Are There? (1/2) 


H = {all lines in R^ 


e how many lines? oo 
e how many kinds of lines if viewed from one input vector x4? 





2 kinds: hi-like(X1) = o or he-like(x1) = х 





Training versus Testing Effective Number of Lines 


How Many Lines Are There? (2/2) 


= {all lines in R°) 


e how many kinds of lines if viewed from two inputs X1, X2? 



































one input: 2; two inputs: 4; three inputs? | 
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Training versus Testing Effective Number of Lines 


How Many Lines Are There? (2/2) 


H= fal lines in R°) | 


e how many kinds of lines if viewed from two inputs X4, X2? 






































one input: 2; two inputs: 4; three inputs? | 


Training versus Testing Effective Number of Lines 


How Many Kinds of Lines for Three Inputs? (1/2) 


= {all lines in R^ | 






































for three inputs X, X2, Хз е X 
SS eee ee а ° x 
| i X 
| eX] | 
o x 
| | ° 
| eXo | x X б 
| «Хз | 8: 
МИ Ен R: CS WO ond | О x 
x ° 
° X 
° x 
always 8 for three inputs? J x ë 
x ° 





Training versus Testing Effective Number of Lines 


How Many Kinds of Lines for Three Inputs? (2/2) 


d = {all lines in R°) 


for three inputs 
X1, X2, Хз 


‘fewer than 8’ when degenerate 
(e.g. collinear or same inputs) 
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Training versus Testing 


How Many Kinds of Lines for Four Inputs? 


Effective Number of Lines 


= {all lines in R°) 


for four inputs X4, X2, Xs, X4 


ея 


for any four inputs 
at most 14 


Hsuan-Tien Lin (NTU CSIE) 





14: 2x 
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Training versus Testing Effective Number of Lines 


Effective Number of Lines 


maximum kinds of lines with respect to N inputs ху, Х2, --- ‚Хм 
<=> effective number of lines 





e must be < 2" (why?) 
e finite ‘grouping’ of infinitely-many lines є H effective(N) 
° wish: 














P [| Ein (9) ign Eout(g)| > e] 
< 2- effective(N) - exp (-22№) 























if 0 effective(N) can replace M and 


(2) effective(N) < 2" 
learning possible with infinite lines :-) 
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Training versus Testing Effective Number of Lines 


Fun Time 





What is the effective number of lines for five inputs є R?? 


@ 14 © 16 © 22 @ 32 
Reference Answer: 9 








If you put the inputs roughly around a circle, | | 
you can then pick any consecutive inputs to be |; eX, | 
on one side of the line, and the other inputs їо |; | 
be on the other side. The procedure leads to ! | 
effectively 22 kinds of lines, which is much | I 
smaller than 2° — 32. You shall find it difficult 

to generate more kinds by varying the inputs, 

and we will give a formal proof in future 

lectures. 
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Training versus Testing Effective Number of Hypotheses 


Dichotomies: Mini-hypotheses 


Н = {hypothesis h: X — {x,o}} | 


e call 










h(X1, X2, эсе XN) == (h(X4 ), һ(хә), ея h(Xw)) = Дэ 


a dichotomy: hypothesis ‘limited’ to the eyes of X1,X»,..., XN 
° (X1, X2, Fon XN): 
all dichotomies ‘implemented’ by H on X1, X», ..., хм 











hypotheses H | dichotomies (х1, х2,..., хм) 
e.g. all lines in R2 10000,060xX,00XX,...] 
size | possibly infinite upper bounded by 2” 




















О, х2,...,Хм)|: candidate for replacing M | 
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Training versus Testing Effective Number of Hypotheses 


Growth Function 


• [4A (X1, Xo, ..., хм): depend on inputs 
(X1, X2, A XN) 
e growth function: 
remove dependence by taking max of all 



































possible (X4, х2,..., Xy) 
ma (№) = тах y HOG, Xe, .. Xn) 
X1,X2,...,,XN€ 


e finite, upper-bounded by 2 





how to 'calculate' the growth function? ) 
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Training versus Testing Effective Number of Hypotheses 


Growth Function for Positive Rays 


e X = К (one dimensional) 
e H contains h, where each h(x) = sign(x — a) for threshold a 
e ‘positive half’ of 1D perceptrons 








one dichotomy for a є each spot (хи, Xn+1): 
my (N) = N +1 | 


(М + 1)< 2" when N large! | 





Training versus Testing Effective Number of Hypotheses 


Growth Function for Positive Intervals 


h(x) = —1 | WES h(z) 2 —1 


21 v2 T3 ... TN 


e X = К (one dimensional) 
• H contains h, where each h(x) = +1 iff x [6, r), —1 otherwise 






one dichotomy for each 'interval kind' 


my(N) = cy a 


interval ends іп N+ 1 spots аі x 


dme 
; M 5№+1 


(ЭА? + ¿N + 1) < 2" when N large! | 


ps 











хх хх ооо ооо x | 
ХО ООШ К GR OR УЕ 
жоо Оох О хох е 


хх p ж UN Оо о оо 





Training versus Testing. Effective Number of Hypotheses 


Growth Function for Convex Sets (1/2) 





convex region in blue non-convex region 


e X = R? (two dimensional) 


• H contains h, where h(x) = +1 iff x ina 
convex region, —1 otherwise 





what is my, (№)? 
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Training versus Testing Effective Number of Hypotheses 


Growth Function for Convex Sets (2/2) 


• one possible set of N inputs: 
X1, X2, ..., Xy On a big circle 

e every dichotomy can be implemented 
by H using a convex region slightly 
extended from contour of positive inputs 





mx (N) = 2N 


e call those N inputs ‘shattered’ by H 





my(N) = 2" < 
exists N inputs that can be shattered 


Training versus Testing Effective Number of Hypotheses 


Fun Time 


Consider positive апа negative rays as H, which is equivalent 
to the perceptron hypothesis set in 1D. The hypothesis set is 


often called ‘decision stump’ to describe the shape of its 





hypotheses. What is the growth function m,,( N)? 


ON @N+1 © 2N Ө 2“ 


Two dichotomies when threshold in each of the 
N — 1 'internal' spots; two dichotomies for the 
all-o and all-x cases. 
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Training versus Testing Break Point 


The Four Growth Functions 


e positive rays: тн(М№М) = N 4 1 
* positive intervals: тн(М) = № + TN +1 
* convex sets: ma (N) = 2% 


2D perceptrons: 







тн(М) < 2" in some cases 





what if m,,(N) replaces М? 





Р [|En(g) — Eou(g)| > €l š 2. тн(М) : exp (-22№) 


polynomial: good; exponential: bad 


for 2D or general perceptrons, 
тя(М) polynomial? | 
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Training versus Testing Break Point 


Break Point of H 


what do we know about 2D perceptrons now? 





three inputs: ‘exists’ shatter; 


four inputs, ‘for 


all' no shatter 





if no k inputs can be shattered by 7(, 
call k a break point for H 


° my(k) 2‘ 


e К+1, К-+2, К-З,... also break points! 
e will study minimum break point К 

















2D perceptrons: break point at 4 | 
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Training versus Testing Break Point 


The Four Break Points 


• positive rays: ma (N) =N+1 = O(N) 
break point at 2 

* positive intervals: тн(М) = 4N? + IN +1 = O(N?) 
break point at 3 

* convex sets: ти(М) = 2N 
no break point 

• 2D perceptrons: тн(М) < 2" in some cases 


break point at 4 





conjecture: 
• no break point: m,(N) = 2% (sure!) 
e break point к: m4(N) = O(N*-!) 
excited? wait for next lecture :-) 





Training versus Testing Break Point 


Fun Time 


Consider positive апа negative rays as H, which is equivalent 
to the perceptron hypothesis set in 1D. As discussed in an 


earlier quiz question, the growth function m,,(N) = 2N. What is 
the minimum break point for H? 


9! @ 2 Ө з Ө 4 





At k = 3, (К) = 6 while 2% = 8. 
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Training versus Testing Break Point 


Summary 
@ When Can Machines Learn? 


Lecture 4: Feasibility of Learning 


@ Why Can Machines Learn? 


Lecture 5: Training versus Testing 


e Несар and Preview 
two questions: Еош(9) = Ei (g), and Ej(g) = 0 
e Effective Number of Lines 
at most 14 through the eye of 4 inputs 
e Effective Number of Hypotheses 
at most m;,(N) through the eye of N inputs 
ә Break Point 
when m;,(N) becomes ‘non-exponential’ 





e next: m4 (N) = poly(N)? 
@ How Can Machines Learn? 


@ How Can Machines Learn Better? 
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